10. Convolutional Neural Networks II

Large Convolutional Networks

There are several architectures in the field of Convolutional Networks that have a name. The most common are:

  • LeNet, 1990’s.
  • AlexNet. 2012.
    (Source: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
  • ZF Net. The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and making the stride and filter size on the first layer smaller.
    (Source: https://www.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf)
  • VGGNet. The runner-up in ILSVRC 2014 was the network from Karen Simonyan and Andrew Zisserman that became known as the VGGNet. Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. A downside of the VGGNet is that it is more expensive to evaluate and uses a lot more memory and parameters. Most of these parameters are in the first fully connected layer, and it was since found that these FC layers can be removed with no performance downgrade, significantly reducing the number of necessary parameters.
(Source: https://blog.heuritech.com/2016/02/29/a-brief-report-of-the-heuritech-deep-learning-meetup-5/)

In [7]:
# Small VGG-like convnet in Keras

import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD

# Generate dummy data

def to_categorical(y, num_classes=None):
    """
    Converts a class vector (integers) to binary class matrix.
    """
    y = np.array(y, dtype='int').ravel()
    if not num_classes:
        num_classes = np.max(y) + 1
    n = y.shape[0]
    categorical = np.zeros((n, num_classes))
    categorical[np.arange(n), y] = 1
    return categorical

x_train = np.random.random((100, 100, 100, 3))
y_train = to_categorical(np.random.randint(10, size=(100, 1)), num_classes=10)
x_test = np.random.random((20, 100, 100, 3))
y_test = to_categorical(np.random.randint(10, size=(20, 1)), num_classes=10)

model = Sequential()
# input: 100x100 images with 3 channels -> (100, 100, 3) tensors.
# this applies 32 convolution filters of size 3x3 each.
model.add(Conv2D(32, 3, 3, activation='relu', input_shape=(100, 100, 3)))
model.add(Conv2D(32, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, 3, 3, activation='relu'))
model.add(Conv2D(64, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
print(model.summary())

model.fit(x_train, y_train, batch_size=32, nb_epoch=10)
score = model.evaluate(x_test, y_test, batch_size=32)


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
convolution2d_1 (Convolution2D)  (None, 98, 98, 32)    896         convolution2d_input_1[0][0]      
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)  (None, 96, 96, 32)    9248        convolution2d_1[0][0]            
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D)    (None, 48, 48, 32)    0           convolution2d_2[0][0]            
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 48, 48, 32)    0           maxpooling2d_1[0][0]             
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)  (None, 46, 46, 64)    18496       dropout_3[0][0]                  
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D)  (None, 44, 44, 64)    36928       convolution2d_3[0][0]            
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D)    (None, 22, 22, 64)    0           convolution2d_4[0][0]            
____________________________________________________________________________________________________
dropout_4 (Dropout)              (None, 22, 22, 64)    0           maxpooling2d_2[0][0]             
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 30976)         0           dropout_4[0][0]                  
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 256)           7930112     flatten_2[0][0]                  
____________________________________________________________________________________________________
dropout_5 (Dropout)              (None, 256)           0           dense_3[0][0]                    
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 10)            2570        dropout_5[0][0]                  
====================================================================================================
Total params: 7998250
____________________________________________________________________________________________________
None
Epoch 1/10
100/100 [==============================] - 3s - loss: 2.3024     
Epoch 2/10
100/100 [==============================] - 3s - loss: 2.2982     
Epoch 3/10
100/100 [==============================] - 3s - loss: 2.2872     
Epoch 4/10
100/100 [==============================] - 3s - loss: 2.2710     
Epoch 5/10
100/100 [==============================] - 3s - loss: 2.2566     
Epoch 6/10
100/100 [==============================] - 3s - loss: 2.2378     
Epoch 7/10
100/100 [==============================] - 3s - loss: 2.2409     
Epoch 8/10
100/100 [==============================] - 3s - loss: 2.2539     
Epoch 9/10
100/100 [==============================] - 3s - loss: 2.2608     
Epoch 10/10
100/100 [==============================] - 3s - loss: 2.2750     
20/20 [==============================] - 0s

Exercise

  • Why do we have 896 parameters in the convolution2d_1 layer of the previous example?

  • Compute the number of parameters of the original VGG16 (all CONV layers are 3x3).

    The VGG16 architecture is: INPUT: [224x224x3] $\rightarrow$ CONV3-64: [224x224x64] $\rightarrow$ CONV3-64: [224x224x64] $\rightarrow$ POOL2: [112x112x64] $\rightarrow$ CONV3-128: [112x112x128] $\rightarrow$ CONV3-128: [112x112x128] $\rightarrow$ POOL2: [56x56x128] $\rightarrow$ CONV3-256: [56x56x256] $\rightarrow$ CONV3-256: [56x56x256] $\rightarrow$ CONV3-256: [56x56x256] $\rightarrow$ POOL2: [28x28x256] $\rightarrow$ CONV3-512: [28x28x512] $\rightarrow$ CONV3-512: [28x28x512] $\rightarrow$ CONV3-512: [28x28x512] $\rightarrow$ POOL2: [14x14x512] $\rightarrow$ CONV3-512: [14x14x512] $\rightarrow$ CONV3-512: [14x14x512] $\rightarrow$ CONV3-512: [14x14x512] $\rightarrow$ POOL2: [7x7x512] $\rightarrow$ FC: [1x1x4096] $\rightarrow$ FC: [1x1x4096] $\rightarrow$ FC: [1x1x1000].

  • The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. What is the necessary memory size (supposing that we need 4 bytes for each element) to store intermediate data?


In [ ]:
# your code here

More Large Convolutional Networks

  • GoogLeNet. The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. There are also several followup versions to the GoogLeNet, most recently Inception-v4.
GoogLeNet Architecture. Source: https://arxiv.org/pdf/1409.4842v1.pdf

Blue Box: Convolution | Red Box: Pooling | Yelow Box: Softmax | Green Box: Normalization

Inception Layer. Source: https://arxiv.org/pdf/1409.4842v1.pdf
GoogLeNet parameters and ops. Source: https://arxiv.org/pdf/1409.4842v1.pdf

What is the role of 1x1 convolutions?

  • ResNet. Residual Network developed by Kaiming He et al. was the winner of ILSVRC 2015. It features special skip connections and a heavy use of batch normalization. A Residual Network, or ResNet is a neural network architecture which solves the problem of vanishing gradients in the simplest way possible. If there is trouble sending the gradient signal backwards, why not provide the network with a shortcut at each layer to make things happen more smoothly? The architecture is also missing fully connected layers at the end of the network.
(Source: https://arxiv.org/pdf/1512.03385.pdf)

(Source: https://arxiv.org/pdf/1512.03385.pdf) </cente

Deeper is better?

When it comes to neural network design, the trend in the past few years has pointed in one direction: deeper.

Whereas the state of the art only a few years ago consisted of networks which were roughly twelve layers deep, it is now not surprising to come across networks which are hundreds of layers deep.

This move hasn’t just consisted of greater depth for depths sake. For many applications, the most prominent of which being object classification, the deeper the neural network, the better the performance.

So the problem is to design a network in which the gradient can more easily reach all the layers of a network which might be dozens, or even hundreds of layers deep. This is the goal behind some of state of the art architectures: ResNets, HighwayNets, and DenseNets.

HighwayNets builds on the ResNet in a pretty intuitive way. The Highway Network preserves the shortcuts introduced in the ResNet, but augments them with a learnable parameter to determine to what extent each layer should be a skip connection or a nonlinear connection. Layers in a Highway Network are defined as follows:

$$ y = H(x, W_H) \cdot T(x,W_T) + x \cdot C(x, W_C) $$

In this equation we can see an outline of two kinds of layers discussed: $y = H(x,W_H)$ mirrors the traditional layer, and $y = H(x,W_H) + x$ mirrors our residual unit.

The traditional layer can be implemented as:

def dense(x, input_size, output_size, activation):
  W = tf.Variable(tf.truncated_normal([input_size, output_size], stddev=0.1), name="weight")
  b = tf.Variable(tf.constant(0.1, shape=[output_size]), name="bias")
  y = activation(tf.matmul(x, W) + b)
  return y

What is new is the $T(x,W_t)$, the transform gate function and $C(x,W_C) = 1 - T(x,W_t)$, the carry gate function. What happens is that when the transform gate is 1, we pass through our activation (H) and suppress the carry gate (since it will be 0). When the carry gate is 1, we pass through the unmodified input (x), while the activation is suppressed.

def highway(x, size, activation, carry_bias=-1.0):
  W_T = tf.Variable(tf.truncated_normal([size, size], stddev=0.1), name="weight_transform")
  b_T = tf.Variable(tf.constant(carry_bias, shape=[size]), name="bias_transform")

  W = tf.Variable(tf.truncated_normal([size, size], stddev=0.1), name="weight")
  b = tf.Variable(tf.constant(0.1, shape=[size]), name="bias")

  T = tf.sigmoid(tf.matmul(x, W_T) + b_T, name="transform_gate")
  H = activation(tf.matmul(x, W) + b, name="activation")
  C = tf.sub(1.0, T, name="carry_gate")

  y = tf.add(tf.mul(H, T), tf.mul(x, C), "y")
  return y

With this kind of network you can train models with hundreds of layers.

DenseNet takes the insights of the skip connection to the extreme. The idea here is that if connecting a skip connection from the previous layer improves performance, why not connect every layer to every other layer? That way there is always a direct route for the information backwards through the network.

(Source: https://arxiv.org/abs/1608.06993)

Instead of using an addition however, the DenseNet relies on stacking of layers. Mathematically this looks like:

$$ y = f(x, x-1, x-2, \dots, x-n) $$

This architecture makes intuitive sense in both the feedforward and feed backward settings. In the feed-forward setting, a task may benefit from being able to get low-level feature activations in addition to high level feature activations. In classifying objects for example, a lower layer of the network may determine edges in an image, whereas a higher layer would determine larger-scale features such as presence of faces. There may be cases where being able to use information about edges can help in determining the correct object in a complex scene. In the backwards case, having all the layers connected allows us to quickly send gradients to their respective places in the network easily.

Fully Convolutional Networks

(Source: http://cs231n.github.io/convolutional-networks/#convert)

The only difference between Fully Connected (FC) and Convolutional (CONV) layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters.

However, the neurons in both layers still compute dot products, so their functional form is identical.

Then, it is easy to see that for any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).

Conversely, any FC layer can be converted to a CONV layer.

Let $F$ be the receptive field size of the CONV layer neurons, $S$ the stride with which they are applied, $P$ the amount of zero padding used on the border, and $K$ the depth (number of bands) of the CONV layer.

For example, an FC layer with $K=4096$ that is looking at some input volume of size $7×7×512$ can be equivalently expressed as a CONV layer with $F=7,P=0,S=1,K=4096$.

In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be 1×1×4096 since only a single depth column “fits” across the input volume, giving identical result as the initial FC layer.

This can be very useful!

Consider a ConvNet architecture that takes a $224x224x3$ image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size $7x7x512$. From there, two FC layers of size $4096$ and finally the last FC layers with $1000$ neurons that compute the class scores. We can convert each of these three FC layers to CONV layers as described above:

  • Replace the first FC layer that looks at $[7x7x512]$ volume with a CONV layer that uses filter size $F=7$, giving output volume $[1x1x4096]$.
  • Replace the second FC layer with a CONV layer that uses filter size $F=1$, giving output volume $[1x1x4096]$.
  • Replace the last FC layer similarly, with $F=1$, giving final output $[1x1x1000]$.

It turns out that this conversion allows us to “slide” the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass.

For example, if $224x224$ image gives a volume of size $[7x7x512]$ - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size $[12x12x512]$, since $384/32 = 12$. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size $[6x6x1000]$, since $(12 - 7)/1 + 1 = 6$. Note that instead of a single vector of class scores of size $[1x1x1000]$, we’re now getting and entire $6x6$ array of class scores across the $384x384$ image.

Evaluating the original ConvNet (with FC layers) independently across $224x224$ crops of the $384x384$ image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time. Forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation.

Object Detection and Segmentation

(Source: https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4)

In classification, there’s generally an image with a single object as the focus and the task is to say what that image is. But when we look at the world around us, we see complicated sights with multiple overlapping objects, and different backgrounds and we not only classify these different objects but also identify their boundaries, differences, and relations to one another!

To what extent do CNN generalize to object detection? Object detection is the task of finding the different objects in an image and classifying them.

R-CNN

A team, comprised of Ross Girshick (a name we’ll see again), Jeff Donahue, and Trevor Darrel found that this problem can be solved with AlexNet by testing on the PASCAL VOC Challenge, a popular object detection challenge akin to ImageNet.

The goal of R-CNN is to take in an image, and correctly identify where the main objects (via a bounding box) in the image.

Inputs: Image

Outputs: Bounding boxes + labels for each object in the image.

But how do we find out where these bounding boxes are? R-CNN proposes a bunch of boxes in the image and see if any of them actually correspond to an object.

R-CNN creates these bounding boxes, or region proposals, using a process called Selective Search (see http://www.cs.cornell.edu/courses/cs7670/2014sp/slides/VisionSeminar14.pdf).

At a high level, Selective Search (shown in the image above) looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects.

Once the proposals are created, R-CNN warps the region to a standard square size and passes it through to a modified version of AlexNet.

On the final layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that simply classifies whether this is an object, and if so what object.

Now, having found the object in the box, can we tighten the box to fit the true dimensions of the object? We can, and this is the final step of R-CNN. R-CNN runs a simple linear regression on the region proposal to generate tighter bounding box coordinates to get our final result. Here are the inputs and outputs of this regression model:

Inputs: sub-regions of the image corresponding to objects.

Outputs: New bounding box coordinates for the object in the sub-region.

Fast R-CNN

R-CNN works really well, but is really quite slow for a few simple reasons:

  • It requires a forward pass of the CNN (AlexNet) for every single region proposal for every single image (that’s around 2000 forward passes per image!).
  • It has to train three different models separately - the CNN to generate image features, the classifier that predicts the class, and the regression model to tighten the bounding boxes. This makes the pipeline extremely hard to train.

In 2015, Ross Girshick, the first author of R-CNN, solved both these problems, leading to Fast R-CNN.

For the forward pass of the CNN, Girshick realized that for each image, a lot of proposed regions for the image invariably overlapped causing us to run the same CNN computation again and again (~2000 times!). His insight was simple — Why not run the CNN just once per image and then find a way to share that computation across the ~2000 proposals?

This is exactly what Fast R-CNN does using a technique known as RoIPool (Region of Interest Pooling). At its core, RoIPool shares the forward pass of a CNN for an image across its subregions. In the image below, notice how the CNN features for each region are obtained by selecting a corresponding region from the CNN’s feature map. Then, the features in each region are pooled (usually using max pooling). So all it takes us is one pass of the original image as opposed to ~2000!

(Source: Stanford’s CS231N slides by Fei Fei Li, Andrei Karpathy, and Justin Johnson)

The second insight of Fast R-CNN is to jointly train the CNN, classifier, and bounding box regressor in a single model. Where earlier we had different models to extract image features (CNN), classify (SVM), and tighten bounding boxes (regressor), Fast R-CNN instead used a single network to compute all three.

(Source: https://www.slideshare.net/simplyinsimple/detection-52781995)

Faster R-CNN

Even with all these advancements, there was still one remaining bottleneck in the Fast R-CNN process — the region proposer. As we saw, the very first step to detecting the locations of objects is generating a bunch of potential bounding boxes or regions of interest to test. In Fast R-CNN, these proposals were created using Selective Search, a fairly slow process that was found to be the bottleneck of the overall process.

In the middle 2015, a team at Microsoft Research composed of Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, found a way to make the region proposal step almost cost free through an architecture they (creatively) named Faster R-CNN.

The insight of Faster R-CNN was that region proposals depended on features of the image that were already calculated with the forward pass of the CNN (first step of classification). So why not reuse those same CNN results for region proposals instead of running a separate selective search algorithm?

(Source: https://arxiv.org/abs/1506.01497)

Here are the inputs and outputs of their model:

Inputs: Images (Notice how region proposals are not needed).

Outputs: Classifications and bounding box coordinates of objects in the images.

Mask R-CNN

So far, we’ve seen how we’ve been able to use CNN features in many interesting ways to effectively locate different objects in an image with bounding boxes.

Can we extend such techniques to go one step further and locate exact pixels of each object instead of just bounding boxes? This problem, known as image segmentation, is what Kaiming He and a team of researchers, including Girshick, explored at Facebook AI using an architecture known as Mask R-CNN.

Given that Faster R-CNN works so well for object detection, could we extend it to also carry out pixel level segmentation?

Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object. The branch (in white in the above image), as before, is just a Fully Convolutional Network on top of a CNN based feature map. Here are its inputs and outputs:

Inputs: CNN Feature Map. Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as a binary mask).

(Source: https://arxiv.org/abs/1703.06870)

1D-Conv for text classification

IMDB Movie reviews sentiment classification: Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

The seminal research paper on this subject was published by Yoon Kim on 2014. In this paper Yoon Kim has laid the foundations for how to model and process text by convolutional neural networks for the purpose of sentiment analysis. He has shown that by simple one-dimentional convolutional networks, one can develops very simple neural networks that reach 90% accuracy very quickly.

Here is the text of an example review from our dataset:


In [1]:
'''
This example demonstrates the use of Convolution1D for text classification.
'''

from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.convolutional import Convolution1D, MaxPooling1D
from keras.datasets import imdb


# set parameters:
max_features = 5000
maxlen = 100
batch_size = 32
embedding_dims = 100
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 10

print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), ' train sequences \n')
print(len(X_test), ' test sequences \n')

print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features, embedding_dims, input_length=maxlen))
model.add(Dropout(0.25))

# we add a Convolution1D, which will learn nb_filter
# word group filters of size filter_length:
model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
# we use standard max pooling (halving the output of the previous layer):
model.add(MaxPooling1D(pool_length=2))

model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
model.add(MaxPooling1D(pool_length=2))


# We flatten the output of the conv layer,
# so that we can add a vanilla dense layer:
model.add(Flatten())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.25))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          validation_data=(X_test, y_test))


Using TensorFlow backend.
Loading data...
Downloading data from https://s3.amazonaws.com/text-datasets/imdb_full.pkl
65552384/65552540 [============================>.] - ETA: 0s25000  train sequences 

25000  test sequences 

Pad sequences (samples x time)
X_train shape: (25000, 100)
X_test shape: (25000, 100)
Build model...
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
25000/25000 [==============================] - 142s - loss: 0.4477 - acc: 0.7723 - val_loss: 0.3660 - val_acc: 0.8434
Epoch 2/10
25000/25000 [==============================] - 146s - loss: 0.3189 - acc: 0.8646 - val_loss: 0.4128 - val_acc: 0.8169
Epoch 3/10
25000/25000 [==============================] - 142s - loss: 0.2809 - acc: 0.8842 - val_loss: 0.3468 - val_acc: 0.8540
Epoch 4/10
25000/25000 [==============================] - 143s - loss: 0.2554 - acc: 0.8992 - val_loss: 0.4648 - val_acc: 0.8002
Epoch 5/10
25000/25000 [==============================] - 144s - loss: 0.2279 - acc: 0.9114 - val_loss: 0.3516 - val_acc: 0.8515
Epoch 6/10
25000/25000 [==============================] - 141s - loss: 0.2039 - acc: 0.9222 - val_loss: 0.3769 - val_acc: 0.8538
Epoch 7/10
25000/25000 [==============================] - 140s - loss: 0.1791 - acc: 0.9316 - val_loss: 0.6110 - val_acc: 0.7828
Epoch 8/10
25000/25000 [==============================] - 141s - loss: 0.1600 - acc: 0.9417 - val_loss: 0.4374 - val_acc: 0.8511
Epoch 9/10
25000/25000 [==============================] - 145s - loss: 0.1430 - acc: 0.9480 - val_loss: 0.4554 - val_acc: 0.8472
Epoch 10/10
25000/25000 [==============================] - 141s - loss: 0.1264 - acc: 0.9565 - val_loss: 0.7149 - val_acc: 0.7900
Out[1]:
<keras.callbacks.History at 0x7f93d658cb50>